[LI] Add feature for Spark ORC reader to ignore field ids in files by using a new table property#134
Conversation
….ids pro… (linkedin#122)" This reverts commit 21c3a80.
… using a new table property
|
@rzhang10 is it possible to add unit tests here? |
No, I feel it's quite hard to do, because it's hard inside Iceberg codebase to create a hive table with iceberg files..(for that I will need to repeat all the logic gobblin is currently doing customly in the unit test). I think we can rely on integration testing on the cluster. |
Would it make sense to
If this is far too complex then we can probably leave it out.. your call @rzhang10 |
Adds a new table property
"read.orc.ignore.field-ids.enabled"to control the Spark ORC reader behavior to ignore field-ids in file schema even if it contains it. This feature will be useful for LI-Iceberg to read Gobblin dual hive/Iceberg tables with shared iceberg written files.Integration tested via spark-shell on the cluster, with setting the table property
ALTER TABLE xxx.xxx SET TBLPROPERTIES ('read.orc.ignore.field-ids.enabled' = 'true');makes the table readable.